National Addresses
Background
The United States is one giant network of roads connecting residential and commercial properties. Millions of people every day driving personal vehicles from one location to another. A to B. Today very little thought goes in to the actual mapping of this route as A and B are searchable through network-connected applications with significant algorithmic backends capable of finding the fastest route from A to B in milliseconds. A and B no longer need to be easily searchable in an atlas or organized in an intuitive numbering system. Without these constraints, how have we organized the world around us through addresses? Are there patterns or hidden stories in the way we name where we live and work?
Data is compiled from the National Address Database which currently contains over 78 million entries across more than 36 states. Unfortunately, not all data in select regions is publicly tabulated or available. Please see the map produced by the USDOT which details
which counties have provided data for this dataset. I encourage you to read through the database site; however if you don't, I think they concisely summarize exactly why such a database is important to maintain:
"Accurate and up-to-date addresses are critical to transportation safety and are a vital part of Next Generation 9-1-1. They are also essential for a broad range of government services, including mail delivery, permitting, and school siting..."
Data Processing
The first hurdle we encounter is that the csv (comma separated values) database download of choice is a nearly 30 GB large. 30 GB of text is a monstrous amount of text. Loading and manipulating this amount of data without specialized processes or intentional system design will be very slow. Safe to say the original file will not be contained within
the project's regular GitHub repository.
Before plotting anything, we should can remove excess data that is not informative for us. The dataset contains 60 columns of data for all 78,078,347 individual addresses. This includes some duplicate data like the "StNam_Full" column which contains all of the data from the columns:
eight additional columns. For instance, East 10th Street is listed as StNam_Full, but it has St_PreDir of "East", St_Name of "10th", and St_PosTyp of "Street".
After this processing is complete, the new filesize is only 6.1 GB. Still very large but a lot more manageable.
Another cleaning technique used was that any data entry with fewer than 60 sections of information was immediately excluded. This cut our number of entries from 78 million to 56 million.
Therefore, roughly 1/4 of the data provided was formatted incorrectly. Alternate dataset cleaning techniques could be used to attempt to map these rows to assumed columns, but given that the dataset already contains a subset of all overall addresses, excluding a few more
which were not properly formatted should not significantly change trends.
Street Names
Streets (S) and Avenues (A) are abbreviated in the following. (S/A) indicates both street and avenue versions are present in the top indices. In the 55 million or so addresses in the dataset, they are described by 1,133,644 unqiue street names.
- Main S
- Central Avenue
- Center S
- North S
- (1, 2, 3, 4, 5, 6, 7) S
- South S
- West S
- North 7th S
- West (2, 3, 4, 5) S
- East (2, 3, 4) S
- 5th A
- (E/W/NA) Washington S/A
- Lincoln S/A
- Madison S/A
- Franklin S
- Jefferson S/A
- Jackson S
- Adams S
- Elm S
- Oak S
- Maple S/A
- Walnut S
- Pine S
- Chestnut S
- Cedar S
- Forest A
- Park S/A
- River Road
- Ridge Road
- Riverside Drive
- Atlantic A
- Lake S
- Lakeview Drive
- Meadow Lane
- Airport Road
- Church S
- Mill S
- Market S
- High S
- Broadway
- State S
- Grand A
- Highland S
- Spring S
- Sunset Drive
- Broad S
- Broadway S
- Union S
- Circle Drive
- Front S
- Locust S
- Prospect S
Total counties = 1592 0, 25, 50, 75, 100% for number of addresses per county/state: 1, 4423, 11964, 30119, 1877615, mean = 35003 n = 2527418 0, 25, 50, 75, 100% for number of addresses per street name/county/state: 1, 3, 8, 21, 10613 , mean = 22.05
Let's now take a look at the most populous streets in this database. To make this distinction, we will sum every address per single road in each county. The results are shown in Figure C . Some of these results are relatively surprising to me.
I would have assumed the figure would be dominated by Manhatten streets as the tall buildings would contain a large number of addresses. However, it appears as though no Manhattan streets are on the top 25 (The NY Main street is a very long thoroughfare in Buffalo, not NYC).
Instead of being NY-dominated, this is a good representation of some of the longest continuous roads in major urban centers.
Places 1, 3, 7, and 17 are streets that cut through almost the entirety of Washington DC and are lined by large multi-unit buildings. 14/25 are located in Maricopa County, AZ (Phoenix Metro Area). Phoenix is notorious for its grid layout and these major roads run uninterrupted through an area home to over 5 million residents. High Street in OH is the principal road of Columbus, as is Congress Street in Portland Maine. Atlantic Avenue in Virginia Beach is 3 miles of condominiums, and Old Hickory Avenue encircles Nashville TN. It even has its own Wikipedia page. Rfd in Lake County, IL is a little bit of a mystery to me. It could be a transcription or data entry error, or it could be actual addresses designated as a byproduct of an old USPS service: Rural Free Delivery.
Another fun gimmick we can look at is the length of street names - mainly the longest ones. The first thing I discovered is that Ohio has a tendency to be obnoxious... After a little additional processing to remove Interstates, Highways, and "United States Routes", Figure D is the result.
Had we kept some of these keywords, the figure would be dominated by Ohio's circumlocutory interstates and underpasses like "Interstate 270 South And United States Route 33 Underpass". In fact, out of the 250 "true" longest street names in the dataset, 187 are in Franklin County, OH, and 211 are in OH (~85%).
My personal favorite is East Morgan Ave - East Of Taco Bell Avenue.
Street Suffixes
One curiosity gained while looking at the most common street names in this data set is the decision regarding street suffix, e.g., Street, Road, Lane, Boulevard, etc. Why is Dogwood Lane more common than Dogwood Drive which is more common than Dogwood Street or Avenue? How are these decided? For many I'm not sure if we will ever have a definitive answer. However, I have some couple plausible theories. I speculate that certain road names become more common and that contributes to a snowball effect where a name sounds better in the way it has been written before rather than deviate and create something new. Alternatively, the same underlying subconcious around naming might be consistent across time & geography contributing to independent "namers" converging on the same name. For example: River Road feels more natural than River Street or River Boulevard. There are pieces online that give more historical & utilitarian meaning for each suffix: Steamboat Agent & Visual Guide to Roads.
Based on the resources, it seems that different regions may have different suffix prevalencies depending on original or current land usage. The following plot shows the 8 most common suffixes (by counties containing at least 1 road with that suffix) across the NAD geographic region. Please use the button in the top left to select between suffix and scroll or move around as needed. Decimal heatmap data represents the decimal proportion of roads within that county which use that suffix. For example, St Louis County, MN (very North) has 4901 registered roads in the database, roughly 40% of them are terminated by "Road". If that value were > 50% it would stay the same shade of blue for all values. Some counties are either very red or very blue - often because they only have 1 registered road in this database and that road either matches (1) the selected suffix or doesn't (0).
Numbering
Street names are only one portion of our overall address, let's now look at the 110737 unique house numbers. Are there any trends you might be able to predict before looking at figures?
In Figure E we can see the occurence frequency for house numbers 1-99. Looking at this data a few observations stand out:
Let's extend this data through 499 and see if the trends continue.
Yep. The trend does continue for the most part with the obvious addition that with every 100 numbers the frequency jumps up again. Why might this happen? Are communities starting house numbers at 300 and counting up or incrementing by 5? Hard to be certain from the data alone, but one hypothesis is that just as communities prefer to number in multiples of 5 and 10, they are even more preferential toward multiples of 100. This would mean there are a lot of extra zeros on envelopes shipped around the US. Let me know if you have other theories about why these patterns exist in our addresses.